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UPS DELIVERS OPTIMAL PHASE DIAGRAM IN 
HIGH-DIMENSIONAL VARIABLE SELECTION 
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Cornell University and Carnegie Mellon University 

Consider a linear model Y = Xf3 + z, z ~ N(Q,I n ). Here, X — X n , v , 
where both p and n are large, but p> n. We model the rows of X as i.i.d. 
samples from N(0, ^fJ), where is a p x p correlation matrix, which is 
unknown to us but is presumably sparse. The vector p is also unknown but 
has relatively few nonzero coordinates, and we are interested in identifying 
these nonzeros. 

We propose the Univariate Penalization Screeing (UPS) for variable se- 
lection. This is a screen and clean method where we screen with univariate 
thresholding and clean with penalized MLE. It has two important proper- 
ties: sure screening and separable after screening. These properties enable 
us to reduce the original regression problem to many small-size regression 
problems that can be fitted separately. The UPS is effective both in theory 
and in computation. 

We measure the performance of a procedure by the Hamming distance, 
and use an asymptotic framework where p — > oo and other quantities (e.g., n, 
sparsity level and strength of signals) are linked to p by fixed parameters. 
We find that in many cases, the UPS achieves the optimal rate of conver- 
gence. Also, for many different f2, there is a common three-phase diagram 
in the two-dimensional phase space quantifying the signal sparsity and sig- 
nal strength. In the first phase, it is possible to recover all signals. In the 
second phase, it is possible to recover most of the signals, but not all of 
them. In the third phase, successful variable selection is impossible. UPS 
partitions the phase space in the same way that the optimal procedures do, 
and recovers most of the signals as long as successful variable selection is 
possible. 

The lasso and the subset selection are well-known approaches to variable 
selection. However, somewhat surprisingly, there are regions in the phase 
space where neither of them is rate optimal, even in very simple settings, 
such as Q is tridiagonal, and when the tuning parameter is ideally set. 
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1. Introduction. Consider the following sequence of regression problems: 

(1.1) Y^ = X^^ + z { p\ «W~JV(0,J n ), n = n p . 

Here, is an n p x p matrix, where both p and n p are large, but p > n p . 
The pxl vector is unknown to us, but is sparse in the sense that it has s p 
nonzeros where s p <C p. We are interested in variable selection: determining 
which components of j3^ are nonzero. For notational simplicity, we suppress 
the superscript ^ and subscript p whenever there is no confusion. 

A well-known approach to variable selection is subset selection, also known 
as the L°-penalization method (e.g., AIC [2], BIC [23] and RIC [13]). This 
approach selects variables by minimizing the following functional: 

1 (\ ss ) 2 
(1-2) ^ Y ~ X ^ + 2 mo > 

where A ss > is a tuning parameter, and || • \\ q denotes the L 9 -norm. The 
approach has good properties, but the optimization problem (1.2) is known 
to be NP hard, which prohibits the use of the approach when p is large. 

In the middle 1990s, Tibshirani [24] and Chen et al. [6] proposed a trail- 
breaking approach which is now known as the lasso or the basis pursuit. 
This approach selects variables by minimizing a similar functional, but ||/3||o 
is replaced by \\/3\\i. 

(1.3) ±||y-X/3||l + A lasso ||/3||i- 

A major advantage of the lasso is that (1.3) can be efficiently solved by the 
interior point method [6], even when p is relatively large. Additionally, in 
a series of papers (e.g., [9, 10]), it was shown that in the noiseless case (i.e., 
z = 0), the lasso solution is also the subset selection solution, provided that /3 
is sufficiently sparse. For these reasons, the lasso procedure is passionately 
embraced by statisticians, engineers, biologists and many others. 

With that being said, an obvious shortcoming of these methods is that 
the penalization term does not reflect the correlation structure in X, which 
prohibits the method from fully capturing the essence of the data (e.g., 
Zou [30]). However, this shortcoming is largely due to that these methods 
are one-stage procedures. This calls for a two-stage or multi-stage procedure. 

1.1. Screen and clean. An idea introduced in the 1960s, screen and clean, 
has seen a revival recently [12, 27]. This is a two-stage method, where, at the 
first stage, we remove as many irrelevant variables as possible while keeping 
all relevant ones. At the second stage, we reinvestigate the surviving variables 
in hope of removing all false positives. The screening stage has the following 
advantages, some of which are elaborated in the literature: 

• Dimension reduction. We remove many irrelevant variables, reducing the 
dimension from p to a much smaller number [12, 27]. 
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• Correlation complexity reduction. A variable may be correlated to many 
other variables, but few of which will survive the screening; it is only 
correlated with a few other surviving variables. 

• Computation complexity reduction. Under some conditions (e.g., Section 2), 
surviving variables can be grouped into many small units, each has a size 
< K, and correlation between units is weak. These units can be fitted 
separately, with computational cost < # of units x 2 K . 

Despite the perceptive vision and philosophical importance in these works 
[12, 27], substantial vagueness remains: How to screen? How to clean? Is 
screen and clean really better than the lasso and the subset selection? This 
is where the Univariate Penalization Screening (UPS) comes in. 

1.2. UPS. The UPS is a two-stage method which contains an U-step 
and a P-step. In the [/-step, we screen with univariate thresholding [9] (also 
known as marginal regression [15] and sure screening [12]). Fix a threshold 
t > 0, and let Xj be the jth column of X. We remove the jth variable from 
the regression model if and only if | (xj,Y)\ < t. The set of surviving indices 
is then W p (t) = U v (t- Y, X) = {j : \( Xj , Y)\ > t, 1 < j < p}. 

Despite its simplicity, the [/-step can be effective in many situations. The 
key insight is that U p (t) has the following important properties: 

• Sure Screening (SS). With overwhelming probability, U p (t) includes all 
but a negligible proportion of the signals (i.e., nonzero coordinates of (5). 
The terminology is slightly different from that in [12]. 

• Separable After Screening (SAS). Define a graph where {1, 2, . . . ,p} is the 
set of nodes, and nodes j and k are connected if and only if \(xj,Xk)\ is 
large (i.e., columns j and k are "significantly" correlated). The SAS prop- 
erty refers to as that with overwhelming probability, U p (t) splits into many 
disconnected small-size components [a component is a maximal connected 
subgraph of U p (t)}. 

We now explain how these properties pave the way for the P-step. Let 
%o = {h, ■ ■ ■ , ^k} and Jo = {ji, . . . be two subsets of {1, 2, . . . ,p}, 1 < K, 
L <p. We have the following definition. 

Definition 1.1. For any px 1 vector Y, Y x ° denotes the K x 1 vector 
such that Y x °(k) = Yi k , 1 < k < K. For any p x p matrix Q, Q 1 ""^ denotes 
the K x L matrix such that n x °' Jo {k,£) = tt(i k ,j e ), l<k<K,l<£<L. 

Note that the regression model is closely related to the model X'Y = 
X'Xfi + X'z. Restricting the attention to U = U p (t), we have 

(X'Yf = (X'Xf3) u + {X'zf = (X'X) U ' V f3 + {X'zf, 

where V = {l,2,...,p}. Three key observations are the following: (a) since 
z ~ N(0, I n ), (X'zf ~ N(0, (X'X) U ' U ), (b) by the sure screening property, 



4 



P. JI AND J. JIN 



(X'X) U ' V f3 » (X'X) U ' U /3 U and (c) by the SAS property, (X'X) U > U approx- 
imately equals a block diagonal matrix, where each block corresponds to 
a maximal connected subgraph contained in U p (t). As a result, the original 
regression problem reduces to many small-size regression problems that can 
be solved separately, each at a modest computational cost. 

In detail, fix two parameters A ups and u nps . Let Zq = {i±, 12, ■ ■ ■ , ir} C 
U p {t) be a component, and let [M be a K x 1 vector the coordinates of 
which are either or u ups . Write A = (X'X) Xo ' X ° for short. Let (m{Zq) = 
(m{Zq\ Y, X, t, A ups , u nps ,p) be the minimizer of the functional 

(1.4) \{{X'Y) X ° - Av)'A- l ({X'Y) x ° - Aim) + i(A ups ) 2 ||/i|| . 

Combining all such estimates across different components oiU p (t) gives the 
UPS estimator, denoted by /3 ups = f3 ups (Y,X; t, A ups , u ups ,p), 

u P s _ f (p,(Io))k, if J = ifc Glo for some X = «2, • • ■ ^k} CU p (t), 
■\0, tij$U p (t p ). 



The UPS uses three tuning parameters (t, A ups , u ups ). In many cases, the 
performance of the UPS is relatively insensitive to the choice of t, as long as 
it falls in a certain range. The parameter A ups has a similar role to those of 
the lasso and the subset selection, but there is a major difference: the former 
can be conveniently estimated using the data, whereas how to set the latter 
remains an open problem. See Section 2 for more discussion. 

We are now ready to answer the questions raised in the end of Section 1.1: 
UPS indeed has advantages over the lasso and the subset selection. In Sec- 
tions 1.3—1.7, we establish a theoretic framework and investigate these pro- 
cedures closely. The main finding is the following: for a wide range of design 
matrices X, the Hamming distance of the UPS achieves the optimal rate 
of convergence. In contrast, the lasso and the subset selection may be rate 
nonoptimal, even for very simple design matrices. 

1.3. Sparse signal model and universal lower bound. We model (3 by 

(1.5) /3/~ d '(l-e)^o + e^, < e< 1,1 < j <p, 

where vq is the point mass at 0, and ir is a distribution that has no mass at 0. 
We use p as the driving asymptotic parameter and allow (e, n) to depend 
on p. Fix < < 1 and recall that s p is the number of signals. We calibrate 

(1.6) e = £ p =p~® so that s p ~ pe p = p 1 ""® . 

For any variable selection procedure /3 = f3(Y\X), we measure the loss by 
the Hamming distance 

p 

h p 0,P\X) = h p 0,P;e p ,7r p ,n p \X)=E Ept i rp ^l(sgn(/%) ^sgn(^)) 
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where sgn(O) = 0. In the context of variable selection, the Hamming dis- 
tance is a natural choice for loss function. While the focus of this paper is 
on selection error where we use Lo-loss, the idea can be extended to the es- 
timation setting where we use Lg-loss (0 < q < oo), but we have to perform 
an additional step of least square fitting after the selection. 

Somewhat surprisingly, there is a lower bound for the Hamming distance 
that holds for all sample size n and design matrix X (and so "universal 
lower bound"). The following notation is frequently used in this paper. 



Definition 1.2. L p > is a multi-log(p) term which may change from 

p— >oo Lp 



occurrence to occurrence, such that for any fixed 5 > 0, lim^oo L p ■ p s = oo 



and lirrip-s.oo L p p =0. 

Now, fixing r > 0, we introduce 
(1.7) t p = T p (r) = y / 2rlogp 



and X p = X p (e p ,Tp) = ^-[log(-^ £ ) + ^-]. Let $ = 1 — $ be the survival func- 
tion of .ZV(0, 1). The following theorem is proved in [18]. 



Theorem 1.1 (Lower bound). Fix i? G (0,1), r > and a sufficiently 
large p. Let e p , s p and r p be as in (1.6) and (1-7), and suppose the support 
of Tip is contained in [— r p , 0) U (0, r p ] . For any fixed n and matrix X = X^ 
such that X'X has unit diagonals, h p (/3,(3\X) > s p ■ [(1 — £p)<&(Xp)/e p + 

H-rp-Xp)]. 

Note that as p — > oo , 



(1.8) L^$( Ap ) + $( Tp - X p ) > { ^ • P ~ 
£p 1(1 + 



(l + o(l)), r<$. 



It may seem counterintuitive that the lower bound does not depend on n, but 
this is due to the way we normalize X . In the case of orthogonal design [i.e., 
coordinates of X and i.i.d. from iV(0, 1/n)], the lower bound can be achieved 
by either the lasso or marginal regression [15]. Therefore, the orthogonal 
design is among the best in terms of the error rate. 

Theorem 1.1 says that if we have signals, and the maximal signal 
strength is slightly smaller than ^/2i? log(p), then the Hamming distance 
of any procedure cannot be substantially smaller than s p , and so success- 
ful variable selection is impossible. In the sections below, we focus on the 
case where the signal strength is larger than y^2??log(p), so that successful 
variable selection is possible. 

The universality of the lower bound hints it may not be tight for nonorthog- 
onal X. Fortunately, it turns out that in many interesting cases, the lower 
bound is tight. To facilitate the analysis, we invoke the random design model. 
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1.4. Random design, connection to Stein's normal means model. Write 
X = (x\,X2, • • • , Xp) = (X\,X2, • • • , X n )' . We model X{ as i.i.d. samples from 
a p-variate zero-mean Gaussian distribution, 

(1.9) Xi l ~N(o,^n\ 

The p x p matrix Q = Q( p ) is unknown, but for simplicity we assume it has 
unit diagonals. The normalizing constant 1/n is chosen so that the diagonals 
of the Gram matrix X'X are approximately 1. Fixing 6 £ (1 — 1), we let 

(1.10) n = n p = p e . 

Note that s p <C n p <C p as p — )• oo. For successful variable selection, it is 
almost necessary to have s p <C n p [9]. Also, denoting the distribution of X by 
F = F p , note that for any variable selection procedure, the overall Hamming 
distance is Hamm p (/3,/3) = Ep[hp(f3\X)]. 

Model (1.9) is called the random design model which may be found in the 
following application £11 C£IS ! 

• Compressive sensing. We are interested in a p-dimensional sparse vector j3. 
We measure n general linear combinations of /3 and then reconstruct it. For 
1 < i < n, choose a p x 1 coefficient vector Xi, and observe Y{ = X^/3 + Zj, 
where Zi ~ N(0,o~ 2 ) is noise. For computational and storage concerns, one 
usually chooses X^s as simple as possible. Popular choices of Xi include 
Gaussian design, Bernoulli design, circulant design, etc. [3, 9]. Model (1.9) 
belongs to Gaussian design. 

• Privacy-preserving data mining. The vector (3 may contain some confi- 
dential information (e.g., HIV-diagnosis results of a community) that we 
must protect. While we cannot release the whole vector, we must allow 
data mining to some extent, because, for example, the study is of pub- 
lic interest and is supported by federal funding. To compromise, we allow 
queries as follows. For each query, the database randomly generates apxl 
vector Xi, and releases both Xi and Yj = X[j3 + Zi to the querier, where 
Zi ~ ./V(0,<7 2 ) is a noise term. For privacy concerns, the number of allowed 
queries is much smaller than p. Popular choices of Xi include Gaussian 
design and Bernoulli design [8]. 

Random design model is closely related to Stein's normal means model 
W ~ N({3,T,), where S = Q" 1 . To see the point, recall that model (1.1) is 
closely related to the model X'Y = X'X/3 + X'z. Since the rows of X are i.i.d. 
samples from N(0, ^Q) and Sp<^n p <^ p, we expect to see that X'X (5 w 
and X'z w N(0, n), and so that X'Y w 7V(0/3, 0). Therefore, Stein's normal 
means model can be viewed as an idealized version of the random design 
model. This suggests that solving the variable selection problem opens doors 
for solving Stein's normal means problem, and vice versa. 
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1.5. Optimality of the UPS. The main results of this paper are Theo- 
rems 2.1 and 2.2 in Section 2. To state such results, we need relatively long 
preparations. Therefore, we sketch these results below, but leave the formal 
statements to later. In models (1.1), (1.5) and (1.9), let (s p ,T p ,n p ) be as 
in (1.6), (1.7) and (1.10). Suppose: 

• Each row of satisfies a certain summability condition, so it has relatively 
few large coordinates. 

• The support of ir p is contained in [r p , (1 + t])t p ], where t p = y / 2rlog(p), 
and r\ is a constant to be defined later. We suppose r > so that successful 
variable selection is possible; see Theorem 1.1. 

• Either all coordinates of fi are positive, or that r/fl < 3 + 2v2 (so that 
we won't have too many "signal cancellations" [27]). 

Fix < q < (0 + r) 2 /(4r), and set the tuning parameters (t, A ups , u ups ) by 

t; = t;(q) = y / 2q~hg~p~, A ups = A» ps = v^logb), u^ s = u» ps = Tp . 

The main result is that, as p — > oo, the ratio between the Hamming error 
of the UPS and s p is no grater than L p p~^~ r ^ /( 4r ). Comparing this with 
Theorem 1.1 gives that the lower bound is tight, and the UPS is rate optimal. 

1.6. Phase diagram for high- dimensional variable selection. The above 
results reveal a watershed phenomenon as follows. Suppose we have roughly 
s p = p l ~^ signals. If the maximal signal strength is slightly smaller than 
V / 201ogp, then the Hamming distance of any procedure cannot be substan- 
tially smaller than s p , hence successful variable selection is impossible. If the 
minimal signal strength is slightly larger than -y/20 logp, then there exist pro- 
cedures (UPS is one of them) whose Hamming distances are substantially 
smaller than s p , and they manage to recover most signals. 

The phenomenon is best described in the special case where tt p = v T is 
the point mass at t p , with t p = y/2r logp as in (1.7). If we call the two- 
dimensional domain {($, r) : < $ < 1, r > 0} the phase space, then the the- 
orems say that the phase space is partitioned into three regions: 

• Region of no recovery (0 < •& < 1, < r < 1?). In this region, the Ham- 
ming distance of any procedure > s p , and successful variable selection is 
impossible. 

• Region of almost full recovery [0 < < 1, < r < (1 + VT^&) 2 }. In this 
region, there are procedures (e.g., UPS) whose Hamming errors are much 
larger than 1, but are also much smaller than s p . In this region, it is 
possible to recover most of the signals, but not all of them. 

• Region of exact recovery [0 < < 1, r > (1 + VT^d) 2 ]. In this region, 
there are procedures (e.g., UPS) that recover all signals with probabil- 
ity » 1. 
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Fig. 1. Left: phase diagram. In the yellow region, the UPS recovers all signals with high 
probability. In the white region, it is possible (i.e., UPS) to recover almost all signals, but 
impossible to recover all of them. In the cyan region, successful variable selection is impossi- 
ble. Right: partition of the phase space by the lasso for the tridiagonal model (l.ll)-(l.lZ) 
(a — OA). The lasso is rate nonoptimal in the nonoptimal region. The region of exact 
recovery by the lasso is substantially smaller than that displayed on the left. 

See Figure 1 (left panel) for these regions. Note that the partitions are the 
same for many choices of fi. Because of the partition of the phases, we call 
this the phase diagram. The UPS is optimal in the sense that it partitions 
the phase space in exactly the same way as do the optimal procedures. 

The phase diagram provides a benchmark for variable selection. The lasso 
would be optimal if it partitions the phase space in the same way as in the 
left panel of Figure 1. Unfortunately, this is not the case, even for very 
simple f2. Below we investigate the case where X'X is a tridiagonal matrix, 
and identify precisely the regions where the lasso is rate optimal and where 
it is rate nonoptimal. More surprisingly, there is a region in the phase space 
where the subset selection is also rate nonoptimal. 

1.7. Nonoptimal region for the lasso. In Sections 1.7 and 1.8, we tem- 
porarily leave the random design model and consider Stein's normal means 
model, which is an idealized version of the former. Using an idealized ver- 
sion is mainly for mathematical convenience, but the gained insight is valid 
in much broader settings: if a procedure is nonoptimal in simple cases, we 
should not expect them to be optimal in more complicated cases. 

In this spirit, we consider Stein's normal means model 

(1.11) Y = X'Y ~JV(fi0,fi), 

where j3 is as in (1.5) with r p = u n and tt p = sj2r log(p). To further simplify 
the study, we fix a € (0, 1/2) and take as the tridiagonal matrix T(a): 

(1.12) T(a)(i,j) = l{i=j} + a-l{\i-j\ = l}, l<i,j<p. 
Note that in this case the UPS partitions the phase space optimally. 
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We now discuss the phase diagram of the lasso. The region {(i?,r):0 < 
*& < l,r > $} is partitioned into three regions as follows (see Figure 1): 

• Nonoptimal region: < ■& < 2o(l + a)" 1 and \{l + y/l - a 2 )$ < r < (1 + 
y^l^f ) 2 (1 — I n this region, the lasso is rate nonoptimal [i.e., the Ham- 
ming distance is L p • p c with constant c> I — ($ + r) 2 / (4r)], even when 
the tuning parameter is set ideally. 

• Optimal region: < i? < 1 and < r < ±(1 + y/1 - a 2 )<d and r < (1 + 
y/l — "d) 2 . In this region, if additionally a > 1/3, then the lasso may be 
rate optimal if the tuning parameter is set ideally. The discussion on the 
case < a < 1/3 is tedious so we skip it. 

• Region of exact recovery: < i? < 1 and r > (1 + \/l — i9) 2 and r > (1 + 

y^-j-rf ) 2 (1 — "&)■ In this region, if the tuning parameter is set ideally, the 
lasso may yield exact recovery with high probability. Region of exactly 
recovery by the lasso is substantially smaller than that of the UPS. There 
is a sub-region in the phase space where the UPS yields exact recovery, 
but the lasso could not even when the tuning parameter is set ideally. 

For discussions in the case where is the identity matrix, compare [15, 
25]. The above results are proved in Theorem 4.1, where we derive a lower 
bound for the Hamming errors by the lasso. In [17], we show that the lower 
bound is tight for properly large i9, but is not when i? is small. It is, however, 
tight for all $ 6 (0, 1) if we replace model (1.5) by a closely related model, 
namely (2.2) and (2.3) in [16]. For these reasons, the nonoptimal region of 
the lasso may be larger than that illustrated in Figure 1. The discussion on 
the exact optimal rate of convergence for the lasso is tedious and we skip it. 

Why is the lasso nonoptimal? To gain insight, we introduce the term of 
fake signal, a noise coordinate that may look like a signal due to correlation. 

Definition 1.3. We say that Yj is a signal if /3j / 0, is a fake signal if 
(fi/3)j / and j3j = 0, and is a (pure) noise if (3j = (fi/3)j = 0- 

With the tuning parameter set ideally, the lasso is able to distinguish 
signals from pure noise, but it does not filter out fake signals efficiently. In 
the optimal region of the lasso, the number of falsely kept fake signals is much 
smaller than the optimal rate, so it is negligible; in the nonoptimal region, the 
number becomes much larger than the optimal rate, and so is nonnegligible. 
This suggests that when X'X moves away from the tridiagonal case, the 
partitions of the regions by the lasso may change, but the nonoptimal region 
of the lasso continues to exist in rather general situations. 

The nonoptimality of the lasso is largely due to the fact that it is a one- 
stage method. An interesting question is whether UPS continues to work 
well if we replace the univariate thresholding by the lasso in the screening 
stage. The disadvantage of this proposal is that, compared to the univariate 
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thresholding, the lasso is both slower in computation and harder to analyze 
in theory. Still, one would hope the lasso could perform well in screening. 

With that being said, we note that the implementation of the lasso only 
needs minimal assumption on the model, which makes it very attractive, 
especially in complicated situations. In comparison, we need both signal 
sparsity and graph sparsity to implement the UPS, and how to extend it to 
more general settings remains unknown. The exploration along this line is 
continued in our forthcoming manuscripts [11, 19, 20]; see details therein. 

1.8. Nonoptimal region for the subset selection. The discussion on the 
subset selection is similar to that for the lasso so we keep it brief. Introduce 
vi(a) = ^ 2 g2^i a S) and ^2 («) = 2\/l — a? — 1. Similarly, the phase space 
partitions into three regions as follows: 

• Nonoptimal region: < ?9 < ?~7~vj^p and vi(a)i} < r < [ „ 2 ( a ) (v 7 ! ~ 2$ + 
^/l-2$ + $v 2 {a))} 2 . 

• Optimal region: < i? < 1 and i? < r < Ui(a)$ and r < (1 + y/l — i9) 2 . 

• Exact recovery region: < •& < 1, r > (l + Vl -#) 2 and r > [1(^/1 = 20+ 
^l-2$ + $v 2 {a))} 2 . 

See Theorem 4.2 for proofs and Figure 2 for illustration. Similar to the 
remarks in Section 1.7, the region of exact recovery and the optimal region 
of the subset selection may be smaller than those illustrated in Figure 2. 

The reason why the subset selection is nonoptimal is almost the opposite 
to that of the lasso: the lasso is nonoptimal for it is too loose on fake signals, 
but the subset selection is nonoptimal for it is too harsh on signal clusters 




Fig. 2. Left: a re-display of the left panel of Figure 1. Right: partition of the phase 
space by the subset selection in the tridiagonal model (1.11)— (1.12) (a = 0A). The subset 
selection is not rate optimal in the nonoptimal region. The exact recovery region by the 
subset selection is substantially smaller than that of the optimal procedure, displayed on 
the left. 
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(pairs/triplets, etc.). With the tuning parameter set ideally, the subset se- 
lection is effective in filtering out fake signals, but it also tends to kill one 
or more signals when the true signals appear in clusters. These falsely killed 
signals account for the nonoptimality. See Section 4.2 for details. 

1.9. Connection to recent literature. This work is related to recent liter- 
ature on oracle property [22, 30], but is different in important ways. A pro- 
cedure has the oracle property if it yields exact recovery. However, exact 
recovery is rarely seen in applications, especially when p> n. In many ap- 
plications (e.g., genomics), a large p usually means that signals are sparse or 
rare, and a small n usually means signals are weak. For rare and weak signals, 
exact recovery is usually impossible. Therefore, it is both scientifically more 
relevant and technically more challenging to compare error rates of different 
procedures than to investigate when they satisfy the oracle property. 

The work is also related to [5, 28] on asymptotic minimaxity, where the 
lasso was shown to be asymptotic rate optimal in the worst-case scenario. 
While their results seem to contradict with those in this paper, the dif- 
ference can be easily reconciled. In the minimax approach, the asymptotic 

least favorable distribution of (3 is given by (1 — s p )vq + e p u Tp , where 

e p =p~^, t p = y / 2TTog _ p and notably •& = r, which corresponds the bound- 
ary line of the region of no recovery in the phase space (e.g., [28], pages 18 
and 19, [1], Section 3). This suggests that the minimax approach has limita- 
tions: it reduces the analysis to the worst-case scenario, but the worst-case 
scenario may be outside the range of interest. In our approach, we let r) 
range freely, and evaluate a procedure based on how it partitions the phase 
space. Our approach has a similar spirit to that in [10]. 

The work is also related to the adaptive lasso [30]. The adaptive lasso 
is similar to the lasso, but the L 1 -penalty A lasso ||/3||i is replaced by the 
weighted L 1 -penalty Ylj=i w j I Pj 1 j where w = (wi , . . . , w p )' is the weight vec- 
tor. Philosophically, we can view the adaptive lasso as a screen and clean 
method. Still, the proposed approach is different from the adaptive lasso in 
important ways. First, Zou [30] suggested weight choices by the least squares 
estimate, which is only feasible when p is small. In fact, when p> n, our 
results suggest that feasible weights should be very sparse, while the weights 
suggested by the least squares estimates are usually dense. Second, for the 
surviving indices, we first partition them into many disjoint units of small 
sizes, and then fit them individually. The adaptive lasso fits all surviving 
variables together, which is computationally more expensive. Last, we use 
penalized MLE in the clean step while the adaptive lasso uses Z^-penalty. 
As pointed out before, the L 1 -penalty in the clean step is too loose on fake 
signals, which prohibits the procedure from being rate optimal. 

The work is also related to other multi-stage methods, for example, the 
threshold lasso [29] or the LOL [21]. These methods first use the lasso and 
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the OLS for variable selection, respectively, followed by an additional thresh- 
olding step. However, by an argument similar to that in Sections 1.7 and 1.8, 
it is not hard to see that these procedures do not partition the phase diagram 
optimally. 

1.10. Contents. In summary, we propose the UPS as a two-stage method 
for variable selection. We use Univariate thresholding in the screening step 
for its exceptional convenience in computation, and we use penalized MLE 
in the cleaning step because it is the only procedure we know so far that 
yields the optimal rate of convergence. On the other hand, the lasso and 
even the subset selection do not partition the phase space optimally. 

The remaining sections are organized as follows. Section 2 discusses the 
UPS procedure and the upper bound for the rate of convergence. The sec- 
tion also addresses how to estimate the tuning parameters of the UPS and 
the convergence rate of the resultant plug-in procedure. Section 3 discusses 
a refinement of the UPS for moderately large p. Section 4 discusses the be- 
havior of the lasso and the subset selection. Section 5 discusses numerical 
results where we compare the UPS with the lasso (the subset selection is 
computationally infeasible for large p so is not included for comparison). 
Due to limited space, we do not include proofs in this paper. The proofs can 
be found in the supplementary material for the paper [18]. 

Below is some notation we use in this paper. Fix < q < oo. For a p x 1 
vector x, \\x\\ q denotes the L 9 -norm of x, and we omit the subscript when 
q = 2. For a p x p matrix M, \\M\\ q denotes the matrix L 9 -norm, and ||M|| 
denotes the spectral norm. 

2. UPS and upper bound for the Hamming distance. In this section, 
we establish the upper bound for the Hamming distance and show that the 
UPS is rate optimal. We begin by discussing necessary notation. We then 
discuss the ?7-step and its sure screening and SAS properties. Next, we show 
how the regression problem reduces to many separate small-size regression 
problems and explain the rationale of using the penalized MLE in the P- 
step. We conclude the section by the rate optimality of the UPS, where the 
tuning parameters are either set ideally or estimated. 

Since different parts of our model are introduced separately in different 
subsections, we summarize them as follows. The model we consider is 






where 



(2.2) 





i.i.d. 




UPS DELIVERS OPTIMAL PHASE DIAGRAM 



13 



Fixing 8 > 0, i? > 0, and r > 0, we calibrate 

(2.3) e p =p~ , Tp = y / 2r!og~p, n p = p e , 
assuming that 

(2.4) 6> < (1 — t?). 

Recall that the optimal rate of convergence is L p p 1_ ^ +r ) 2 ^ 4T '\ In this sec- 
tion, we focus on the case where the exponent 1 — (1? + r) 2 /(4r) falls be- 
tween and (l — i?), or equivalently, 



(2.5) #<r < (l + Vl-0) ■ 

In the phase space, this corresponds to the region of almost full recovery. 
The case r < i? corresponds to the region of no recovery and is studied in 
Theorem 1.1. The case r > (1 + y/l — i9) 2 corresponds to the region of exact 
recovery. The discussion in this case is similar but is much easier, so we omit 
it. 

Next, fixing A > and 7 G (0, 1), introduce 



p 



Alp (7, A) = < £1 :p x p correlation matrix, |^(^j)| 7 < j 4 5 Vl<i<p>. 

I 3=1 J 

For any £1, let U = U (fi) be the pxp matrix satisfying U = < 
j}, and let d(Q.) = max{||C/(0)|| 1 , (([/(fi)^}. Fixing w G (0, 1/2), introduce 
A4*(w ,7,-4) = {fi G A4p(7,A):d(fi) < w }, and a subset of Al*(a;o,7, A), 

M+(lu , 7, A) = {n G 7W;(a; , 7, A) : n(i,j) > for all 1 < i,j < p}. 

For any Q G M*(uo, 7, A), the eigenvalues are contained in (1 — 2wo, l + 2wo), 
so £1 is positive definite (when loq > 1/2, O may not be positive definite). 
Last, introduce a constant •q = r]{'d,r,ujQ) by 

(2.6 ??=7T min^ — ,1 - -, Jl 1 - cup - 1 + - \. 

We suppose the support of signal distribution tt p is contained in 
(2-7) Ml + ^Tp], 



where r p = W2r log(p) as in (1.7). This assumption is only needed for proving 
the main lemma of the P-step (Lemma A. 5, [18]) and can be relaxed for 
proving other lemmas. Also, we assume the signals are one-sided mainly for 
simplicity. The results can be extended to the case with two-sided signals. 

We now discuss the [/-step. As mentioned before, the benefits of the U- 
step are threefold: dimension reduction, correlation complexity reduction, 
and computation cost reduction. The {/-step is able to achieve these goals 
simultaneously because it satisfies the sure screening property and the SAS 
property, which we now discuss separately. 
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2.1. The sure screening property of the V -step. Recall that in the U-step, 
we remove the jth variable if and only if |(a;,-,Y)| < t for some threshold 
t > 0. For simplicity, we make a slight change and remove the jth variable 
if and only if (xj,Y) < t. When the signals are one-sided, the change makes 
negligible difference. Fixing a constant q E (0, + r) 2 /(4r)), we set the 
threshold t in the U -step 

(2.8) t; = t;(q) = y/2qlog(p). 

Lemma 2.1 (Sure screening). In model (2.1)-(2.2), suppose (2.3)- (2.1) 
hold, and t* is as in (2.8). For sufficiently large p, if G M.p(ujQ,^,A), 
then asp^oo, £? =1 P(x$y < t* p ,Pj / 0) < L p pM^) 2 /(4r) _ The daim re _ 
mains true if alternatively fiW G A4*(u;o,7, -A), r/d < 3 + 2\/2- 

This says that the Hamming errors we make in the [/-step are not sub- 
stantially larger than the optimal rate of convergence, and thus negligible. 

2.2. The SAS property of the U -step. We need some terminology in 
graph theory (e.g., [7]). A graph G = (V,E) consists of two finite sets V 
and E, where V is the set of nodes, and E is the set of edges. A compo- 
nent Iq of V is a maximal connected subgraph, denoted by Iq < V . For any 
node v G V, there is a unique component Tq such that v G Zq < V. 

Fix a p x p symmetric matrix fio which is presumably sparse. If we let 
Vo = {1,2,... ,p} and say nodes i and j are linked if and only if Qo(i, j) ^ 0, 
then we have a graph G = (Vb,fio)- Fi x i > 0. Recall that W p (£) is the set of 
surviving indices in the [/-step 

(2.9) U p (t) = U p (t, Y, X) = {j : (xj,Y) >t,l<j<p}. 

Note that the induced graph (U p (t),Qo) splits into many components. 

Definition 2.1. Fix an integer K >l. We say that U p {t) has the sep- 
arable after screening (SAS) property with respect to (Vq,Qq,K) if each 
component of the graph (U p (t) ,^o) has no more than K nodes. 

Note that if U p (t) has the SAS property with respect to (Vq,Qq,K). Then 
for all s>t, U p (s) also has the SAS property with respect to (Vq,Qo, K). 

Return to model (2.1)— (2.2). We hope to relate the regression setting to 
a graph (Vo, Oo), and use it to spell out the SAS property. Toward this end, 
we set Vo = {1, 2, . . . ,p}. As for 0,q, a natural choice is the matrix f2 in (2.2). 
However, the SAS property makes more sense if fio is sparse and known, 
while O is neither. In light of this, we take to be a regularized empirical 
covariance matrix. 

In detail, let f2 = X'X be the empirical covariance matrix. Recall that X = 
(X\,X2, ■ ■ ■ , X n )' and Xj ~ A r (0, ^fi). It is known [4] that there is a constant 
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C > such that with probability 1 — o(l/p 2 ), for all 1 <i,j <p, 

(2.io) \ti(i,j) - < cyiog(p)/v^- 

For large p, fl is a noisy estimate for fi, so we regularize it by 
( 2 - n ) * (*<•?') = ^ (^^' ) !{ | n ) | > log" 1 (p) } " 

The threshold log" 1 ^) is chosen mainly for simplicity and can be replaced 
by log _a (p), where a > is a constant. The following lemma is a direct result 
of (2.10); we omit the proof. 

Lemma 2.2. Fix A > 0, 7 G (0, 1) and u G (0, 1/2). As p — > 00, for any 
Q G A4*(ujo, 7, A), with probability of 1 — o(l/p 2 ), each row of £1* has no 

more than 21og(p) nonzero coordinates, and ||f2* — ^||oo < C(log(p))~( 1 ~" / ^ . 

Taking Qq = £1* , we form a graph (Vo, £1*). The following lemma is proved 
in [18], which says that, except for a negligible probability, U p {t* p ) has the 
SAS property. 

Lemma 2.3 (SAS). Consider model (2.1)-(2.2) where (2.3)-(2.7) hold. 
Set t* as (2.8). As p—> 00, there is a constant K such that with probability 
1 - Lpp-^+ r ) 2 /(4r) ; has the SAS propert y wit h respect to (V 0> n*,K). 

2.3. Reduction to many small-size regression problems. Together, the 
sure screening property and the SAS property make sure that the original 
regression problem reduces to many separate small-size regression problems. 
In detail, the SAS property implies that U p {t p ) splits into many connected 
subgraphs, each is small in size, and different ones are disconnected. Given 
two disjoint connected subgraphs Iq and Jq where Iq <lA p {t) and Jo<\U p (t), 

(2.12) n*(i,j) = yi £ l ,j £ J . 

Recall that the regression model (1.1) is closely related to the model X'Y = 
X'Xj3 + X'z. Fixing a connected subgraph Iq <\U p (t*), we restrict our at- 
tention to Iq by considering (X'Y) X ° = (X'X/3) X ° + (X'z) x °. See Defini- 
tion 1.1 for notation. Since Xi ' A(0, -Q) and Iq has a small size, we 
expect to see (X'Xf3) x ° ss (fi/3) z ° and (X'z) x ° w A(0, Q x °> x °). Therefore, 
(X'Y) X ° » N{(nf3) Xo ,n Xo ' Xo ). A key observation is 

(2.13) (nf3) Xo ^n Xo ' Xo f3 Xo . 

In fact, letting Iq = {j : 1 < j < p,j ^ Iq}, it is seen that 

(2.14) {np) x ° - n x °' x °p Xo = (n*f°' x op x o + (n - n*) 10 ^^ = i + n. 

First, by Lemma 2.2, |II| < C7||0 — fi*||oo||^||oo = o(y/log(p)) coordinate-wise, 
hence II is negligible. Second, by the sure screening property, signals that are 
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falsely screened out in the [/-step are fewer than L p p 1_ ( l3+ ' r ) 2 /( 4r '), and there- 
fore have a negligible effect. To bring out the intuition, we assume U p {t* p ) 
contains all signals for a moment (see [18], Lemma A. 4, for formal treat- 
ment). This, with (2.12), implies that 1 = 0, and (2.13) follows. 

As a result, the original regression problem reduces to many small-size 
regression problems of the form 

(2.15) (X'Y) X ° « N(Q Xa ' Xo f3 Xo ,n Xo > x °) 

that can be fitted separately. Note that Q, X °' X ° can be accurately estimated 
by (X' X) X °' X ° , due to the small size of Iq. We are now ready for the P-step. 

2.4. P-step. The goal of the P-step is that, for each fixed connected sub- 
graph Iq <lU p (t*), we fit model (2.15) with an error rate < L p p~^ Jrr ' /( 4r ). 
This turns out to be rather delicate, and many methods (including the lasso 
and the subset selection) do not achieve the desired rate of convergence. 

For this reason, we proposed a penalized-MLE approach. The idea can be 
explained as follows. Given that Iq <\U p (t* p ) as a priori, the chance that Iq 

contains k signals is ~£p. This motivates us to fit model (2.15) by maximiz- 
ing the likelihood function e k p ■ exp[-±[(X'Y) x ° - A^i}' A- l [{X'Y) x ° - A/j]], 
subject to \\fi\\o = k. Recalling A = {X 1 'X) Xo,x ° ~ fi 2 ' 0,2 ' , this is proportional 
to the density of (X'Y) X ° in (2.15), hence the name of penalized MLE. Re- 
calling e p =p and Ap PS = \/2i9 logp, it is equivalent to minimizing 

(2.16) [(X'Y) X ° - AtfA-^X'Y)* - Afi) + (A£ ps ) 2 • \\fi\\ . 

Unfortunately, (2.16) does not achieve the desired rate of convergence as 
expected. The reason is that we have not taken full advantage of the infor- 
mation provided: given that all coordinates in Iq survive the screening, each 
signal in Iq should be relatively strong. Motivated by this, for some tuning 
parameter u ups > 0, we force all nonzero coordinates of f/, to equal u ups . This 
is the UPS procedure we introduced in Section 1. In Theorem 2.1 below, we 
show that this procedure obtains the desired rate of convergence provided 
that ii ups is properly set. 

One may think that forcing all nonzero coordinates of \x to be equal is too 
restrictive, since the nonzero coordinates of f3 x ° are unequal. Nevertheless, 
the UPS achieves the desired error rate. The reason is that, knowing the 
exact values of the nonzero coordinates is not crucial, as the main goal is to 
separate nonzero coordinates of (3 X ° from the zero ones. 

Similarly, since knowing the signal distribution n p may be very helpful, 
one may choose to estimate ir p using the data first and then combine the esti- 
mated distribution with the P-step. However, this has two drawbacks. First, 
model (2.15) is very small in size, and can be easily over fit if we introduce 
too many degrees of freedom. Second, estimating n p usually involves decon- 
volution, which generally has relatively slow rate of convergence (e.g., [26]); 
a noisy estimate of tt p may hurt rather than help in fitting model (2.15). 
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2.5. Upper bound. We are now ready for the upper bound. To recap, the 
proposed procedure is as follows: 

• With fixed tuning parameters (t, A ups , u ups ), obtain U p (t) = {j : 1 < j < p, 
( Xj ,Y)>t}. 

• Obtain SI* as in (2.11), and form a graph (Vo,^o) with Vq = {1, 2, . . . ,p} 
and f2 = ^* • 

• Split U p (t) into connected subgraphs where different ones are discon- 
nected. For each connected subgraph 1$ = . . . , ir}, obtain the min- 
imizer of (2.16), where each coordinate of \i is either or u ups . Denote 
the estimate by /i(X ) = fi(l ; Y, X, t, A ups , u ups ,p). 

• For any 1 < j < p, if j ^ U p (t), set (3j = 0. Otherwise, there is a unique 
Iq = {ii,«2) • • • ; ii<} < Mp(t), where ii < ?2 < • ■ • < ik, such that j is the 
fcth coordinate of Xq. Set /9j = (/i(Xo))fc- 

Denote the resulting estimator by (3(Y, X; t, A ups , u nps ). We have the follow- 
ing theorem. 

Theorem 2.1. Consider model (2.1)-(2.2) where (2.3)-(2.7) hold, and 
fix0<q< (■!? + r) 2 /(4r). For sufficiently largep, ifQ^ € M p (ujq, 7, A), and 
we set the tuning parameters of the UPS at 

t = t; = y/2qlog(p), A ups = A^ ps = V^Io^, u ups =^ ps = r p , 

then as p -»• 00, Hamm p (/3 ups (Y, X; t£, A p ps , u p ps ), 0, r, n&0) < L p • s p ■ 
p -(r-&) 2 /(4r)_ The daim remains va i i( i if r /$ < 3 + 2y /2 and e -M P (wo, 
7, A) for sufficiently large p. 

Except for the L p term, the upper bound matches the lower bound in 
Theorem 1.1. Therefore, both bounds are tight and the UPS is rate optimal. 

2.6. Tuning parameters of the UPS. The UPS uses three tuning param- 
eters (t* A p ps , Up ps ). In this section, we show that under certain conditions, 
the parameters (A p ps ,u p ps ) can be estimated from the data. 

In detail, recall that Y = X'Y. For t > 0, introduce F p (t) = ± £)? =1 H Y j > 
t] and fJ, p (i) = | X)j=i ' ^-{Xj > Denote the largest off-diagonal coordi- 
nate of ft by 5o = 5q(£1) = maxr 1< jj <p j_^ J }|r2(i, Recalling that the sup- 
port of 7T p is contained in [r p , (1 + r/)r p ], we suppose 

($ _|_ r )2 

(2.17) 2<5 (l+7?) -l<0/r so that ^(l + 7?) 2 r < v - ' . 

4r 

Let /i p (vr p ) be the mean of 7r p . The following is proved in [18]. 
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Lemma 2.4. Fix q such that max{<5Q(l + r]) 2 r, $} < q < (# + r) 2 /(4r), 
an<f lei i* = \/2qlogp. Suppose the conditions in Theorem 2.1 hold. As 
p — > co, with probability of 1 — o(l/p), 

(2.18) \[F p (t;)/e p ]-l\ = o(l) and / {e p ^ p ))} - l\ = o{\). 

Motivated by Lemma 2.18, we propose to estimate (A ups ,tt ups ) by 
\^ = \^(q) = J-2\og(F p (t* p )), 

(2.19) 

u^ s = u^(q)=^ p (t;)/F p (t;). 

Theorem 2.2. Fix q such that max{^(l + ?7) 2 r,'!9} <q< (1? + r) 2 / (4r) , 
and let t* = ^/2qTogp. Suppose the conditions of Theorem 2.1 hold. As 

p — > oo, if additionally fi*(7r p ) < (1 + o(l))r p , then Hamm p (/3 ups ) < L p ■ s p ■ 
p -(r-*)3/(4r) > 

As a result, i* is the only tuning parameter needed by the UPS. By 
Theorem 2.2, the performance of the UPS is relatively insensitive to the 
choice of t*, as long as it falls in a certain range. Numerical studies in 
Section 5 confirm this for finite p. The numerical study also suggests that 
the lasso is comparably more sensitive to its tuning parameter A lasso . 

2.7. Discussions. While the conditions in Theorems 2.1 are 2.2 are rel- 
atively strong, the key idea of the paper applies to much broader settings. 
The success of UPS attributes to the interaction of the signal sparsity and 
graph sparsity, which can be found in many applications [e.g., compressive 
sensing, genome-wide association study (GWAS)]. 

In the forthcoming papers [11, 19, 20], we revisit the key idea of this paper, 
and extend our results to more general settings. However, the current paper 
is different from [11, 19, 20] in important ways. First, the focus of [11] is on 
ill-posed regression models and change-point problems, and the focus of [20] 
is on Ising model and network data. Second, the current paper uses the so- 
called "phase diagram" as a new criterion for optimality (e.g., [10]), and Jin 
and Zhang [19] use the more traditional "asymptotic minimaxity" as the 
criterion for optimality. Due to the complexity of the problem, one type of 
optimality usually does not imply the other. The current paper and [19] have 
very different targets, objectives and underlying mathematical techniques, 
and the results in either one cannot be deduced from the other. 

The current paper is new in at least two aspects. First, given that marginal 
regression is a widely used method but is not well justified, this paper shows 
that marginal regression can actually work, provided that an additional 
cleaning stage is performed. Second, it shows that L°-penalization method — 
the target of many relaxation methods — is nonoptimal, even in very simple 
settings, and even when the tuning parameter is ideally set. 
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3. A refinement for moderately large p. We introduce a refinement for 
the UPS when p is moderately large. We begin by investigating the relation- 
ship between the regression model and Stein's normal means model. 

Recall that model (1.1) is closely related to the following model: 

(3.1) X'Y = X'X/3 + X'z, 3~2V(0,I„), 

which is approximately equivalent to Stein's normal means model as follows: 

(3.2) X'Y ss tt(3 + N(0, tt) tt^X'Y^ NiP,^- 1 ). 

In the literature, Stein's normal means model has been extensively studied, 
but the focus has been on the case where O is diagonal (e.g., [26]). When Q 
is not diagonal, Stein's normal means model is intrinsically a regression 
problem. To see how close models (3.1) and (3.2) are, write 



(3.3) X'Y- 



n 



tt/3 + i-rX'z 



+ 



(x'x-n)p+ 



1 )^X'z 

n J \\z\\ 



= 1+11. 



First, note that I ~ N(flf3, fi). For II, we have the following lemma. 



Lemma 3.1. Consider model (2.1)-(2.2) where (2.2)- (24) hold. As 
p — > oo, there is a constant C > such that except for a probability of o(l/p), 



\z\ 



1 



n 



< C (y/\og p)p 



-0/2 



\\{x'x - < c\\n\\( y ^2i^)p~^ 1 -^/ 2 . 



It follows that | II | < (7^2 log (p) .p-^-C 1 "^)]/ 2 coordinat e-wise. Therefore, 
asymptotically, models (3.1) and (3.2) have negligible difference. However, 
when p is moderately large, the difference between models (3.1) and (3.2) 
may be nonnegligible. In Table 1, we tabulate the values of a/2 log(p) • 
p-[ e -( l -' a )]/ 2 ; which are relatively large for moderately large p. 

This says that, for moderately large p, the random design model is much 
noisier than Stein's normal means model. As a result, in the [/-step, we tend 
to falsely keep more noise terms in the former than in the latter; some of 
these noise terms are large in magnitude, and it is hard to clean all of them 



Table 1 





The values of y^2log(p)p 


9_(l-0)]/2 


for different p 


and (6,0) 




p 


400 5 X 400 


5 2 X 400 


5 3 X 400 


5 4 X 400 


5 5 X 400 


(9,0) 


= (0.91,0.65) 0.65 0.46 


0.33 


0.22 


0.15 


0.10 


(M) 


= (0.91,0.5) 1.01 0.82 


0.65 


0.51 


0.39 


0.30 
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in the P-step. To see how the problem can be fixed, we write 

(3.4) x'xp = (x'x -n*)p + n*p. 

On one hand, the term (X'X — f2*)/3 causes the random design model to 
be much noisier than Stein's normal means model. On the other hand, this 
term can be easily removed from the model if we have a reasonably good 
estimate of (3. This motivates a refinement as follows. 

For any pX 1 vector y, let S 2 (y) = Y^=l{Vj ~ vf where y = | Y7j=x Vj- 
We propose the following procedure: (1) Run the UPS and obtain an esti- 
mate of p, say, p. Let = X'Y and p(°) = 0. (2) For j = 1,2,3, respec- 
tively, let W® = X'Y - (X'X - O*)/^'- 1 ). If S(W^)/S(W(i-V) < 1.05, 
run the UPS with X'Y replaced by and other parts unchanged, and 

let be the new estimate. Stop otherwise. 

Numerical studies in Section 5 suggest that the refinement is benefi- 
cial for moderately large p. When p is sufficiently large [e.g., y21og(p) • 
p~\ d ~( 1 -'®)]/' 2 < 0.4], the original UPS is usually good enough. In this case, 
refinements are not necessary, but may still offer improvements. 

4. Understanding the lasso and the subset selection. In this section, 
we show that there is a region in the phase space where the lasso is rate 
nonoptimal (similarly for subset selection). We use Stein's normal means 
model instead of the random design model (as the goal is to understand 
the nonoptimality of these methods, focusing on a simpler model enjoys 
mathematical convenience, yet is also sufficient; see Section 1.7). 

To recap, the model we consider in this section is Y ~ N(Q,p, f2), where Y 
is the counterpart of X'Y in the random design model. Fix a 6 (—1/2, 1/2). 
As in Section 1.7, we let O be the tridiagonal matrix as in (1.12), and tt p be 
the point mass at T p = \J2r log p. In other words, 

(4.1) /3/~ d ' (l-e p )v + e p v Tp , e p =p~®, t p = y / 2rlogp. 

Throughout this section, we assume r > $ so that successful variable selec- 
tion is possible. Somewhat surprisingly, even in this simple case and even 
when (e p ,r p ) are known, there is a region in the phase space where neither 
the lasso nor the subset selection is optimal. To shed light, we first take 
a heuristic approach below. Formal statements are given later. 

4.1. Understanding the lasso. The vector Y consists of three main com- 
ponents: true signals, fake signals and pure noise (see Definition 1.3). Ac- 
cording to (4.1), true signals may appear as singletons, pairs, triplets, etc., 
but singletons are the most common and therefore have the major effect. 
For each signal singleton, since Q is tridiagonal, we have two fake signals, 
one to the left and one to the right. Given a site j, 1 < j < p, the lasso may 
make three types of errors: 
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• Type I. ij is a pure noise, but the lasso mistakes it as a signal. 

• Type II. Yj is a signal singleton, but the lasso mistakes it as a noise. 

• Type III. Yj is a fake signal next to a signal singleton, but the lasso 
mistakes it as a signal. 

There are other types of errors, but these are the major ones. 

To minimize the sum of these errors, the lasso needs to choose the tuning 
parameter A lasso carefully. To shed light, we first consider the uncorrelated 
case where Q is the identity matrix. In this case, we do not have fake signals 
and it is understood that the lasso is equivalent to the soft-thresholding 
procedure [26], where the expected sum of types I and II errors is 

(4.2) p[(l - e p )$(A lasso ) + £p $(A lasso - t p )}. 

Here, $ = 1 — $ is the survival function of iV(0, 1). In (4.2), fixing < q < 1 
and taking A lasso = A p asso = y/2q\og(p), the expected sum of errors is 

^ f L p [p l ~i + p 1 -(< ? +(v / 9-v^) 2 )] 5 if o < q < r> 
lp 1 ~ q +p 1 ~^, iiq>r. 

The right-hand side is minimized at q = ($ + r) 2 /(4r) at which A p asso = 
^f-Tp, and the sum of errors is L p p 1_ ^ +r ^ /( 4r ), which is the optimal rate 
of convergence. For a smaller q, the lasso keeps too many noise terms. For 
a larger q, the lasso kills too many signals. 

Return to the correlated case. The vector Y is at least as noisy as that 
in the uncorrelated case. As a result, to control the type I errors, we should 
choose A p asso to be at least 2r~ r P- This is confirmed in Lemma 4.2 below. 

In light of this, we fix q > (t? + r) 2 /(4r) and let A p asso = y / 2qlog(p) from 
now on. We observe that except for a negligible probability, the support 
of /3 lasso , denoted by 5 p asso , splits into many small clusters (i.e., block of 
adjacent indices). There is an integer K not depending on p that has the 
following effects: (a) If Yj is a pure noise, and there is no signal within 
a distance of K from it, then either /3} asso = 0, or /3] asso / but /3^f = 0, and 
(b) If Yj is a signal singleton, and there is no other signal within a distance 
of K from it, then either /3j asso = 0, or /3j asso ^ but /3j±2 = and at least 
one of {/3j+f°, /^f } is 0. These heuristics are justified in [17] (we use such 
heuristics to provide insight, but not for proving results below). 

At the same time, let Zo = {j — k + 1, . . . , j} C 5 p asso be a cluster, so that 
^lasso = ^lasso = Q gince n ig tr idiagonal, (/S 1 ^ 80 )^, the restriction of /3 lasso 
to Zo, is the solution of the following small-size minimization problem: 

(4.3) lfi'{Q Xo ' Xo )fi - n'Y Xo + A lasso ||^||i where ^ is a k x 1 vector. 
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See Definition 1.1. Two special cases are noteworthy. First, Iq = {j}, and 
the solution of (4.3) is given by /3] asso = sgn(Y/)(|Yjl - A lasso ) + , which is the 
soft-thresholding [26]. Second, Xq = {j — 1, j}. We call the solution of (4.3) in 
this case the bivariate lasso. We have the following lemma, where all regions 
I-IIId are illustrated in Figure 3 (x-axis is Yj-i, y-axis is Yj). 

Lemma 4.1. Denote A = A lasso . The solution of the bivariate lasso 0^f, 
/3] asso ) is given by 0] a I\°Jf sso ) = (sgn(^-_i)(|^_i| - A)+,sgn(^)(|^| - 
A)+) if {Yj-i,Yj) is in regions I, Ila-IId and 0}™f, /3] asso ) = ^(Zj^ - 
aZj,Zj — aZj_i) if (Yj_i,Yj) is in regions Illa-IIId. Here, Zj_i =Yj—\ — A if 
(Yj-i,Yj) is in regions Ilia, Hid and Zj_\ = lj— i + A otherwise; Zj =Yj — A 
if (Yj—i,Yj) is in regions Ilia, Illb and Zj =Yj + A otherwise. 

In the white region of Figure 3, both /^j 3 ^ and /jj asso are 0. In the blue 
regions, exactly one of them is 0. In the yellow regions, both are nonzero. 
Lemma 4.1 is proved in [18]. 

As a result, the following hold, except for a negligible probability: 

• Type I. There are 0(p) indices j where Yj is a pure noise, and no signal 
appears within a distance of K from it. For each of such the lasso 
acts on Yj as (univariate) soft-thresholding, and p\ asso ^ if and only if 

ftl >\p SSO - 

• Types II— III. There are 0(pe p ) indices where Yj is a signal singleton, and 
no other signal appears within a distance of K from it. The lasso either 
acts on Yj as soft-thresholding, or acts on both Yj and one of its neighbors 

as the bivariate lasso. As a result, /3j asso = if and only if \Yj\ < Ajf sso 

(type II), and both /3j asso and p l J^f> are nonzero if and only if (Y,_i, Y^)' 
falls in regions Illa-IIId, with Ilia and Illb being the most likely (type III). 
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Noting that Yj ~ N(0, 1) if it is a pure noise and Yj ~ N(t p , 1) if it is a sig- 
nal singleton, the sum of types I and II errors is L p p[P(N(0, 1) > Ajf sso ) + 
£ p P(N(t p , 1) < Ajf sso )] = L pP m\ p asso ) + e p $(Ajf sso - t p )}. Also, when Yj is 

a signal singleton, (Yj—i^Yj)' is distributed as a bivariate normal with 
means ar p and r p , variances 1, and correlation a. Denote such a bivariate 
normal distribution by W for short. The type III error is L p p ■ P{f3j^\ = 0, 
(3j =t p , (Yj-x,Yj)' G regions Ilia or Illb) ~ L p pe p ■ P(W G regions Ilia 
or Illb). Therefore, the sum of three types of errors is 

(4.4) L p p ■ [i>(Aj? sso ) + e p $(Aj, asso - t p ) + e p P(W G regions Ilia or Illb)] , 

which can be conveniently evaluated. Note that the sum of types I and II 
errors in the correlated case is the same as that in the uncorrelated case, 
which is minimized at A^ = (i? + r)/(2r)r p . Therefore, whether the lasso 
is optimal or not depends on whether the type III error is smaller than the 
optimal rate of convergence or not. Unfortunately, in certain regions of the 
phase space, the type III error can be significantly larger than the optimal 
rate. In other words, provided that the tuning parameters are properly set, 
the lasso is able to separate the signal singletons from the pure noise. How- 
ever, it may not be efficient in filtering out the fake signals, which is the 
culprit for its nonoptimality. 

For short, write Hamm p (/3 lasso (Aj? sso )) = Hamm(/3 lasso (Ajf so ); e p , r p , a). The 
following is proved in [18], confirming the above heuristics. 

Lemma 4.2. Fix $ G (0,1), r > q>0 and a G (-1/2, 1/2). Set the 
lasso tuning parameter as X l p SSO = \/2qlogp. As p—> oo, 

Hamm(/3 lasso (Ajf sso )) 

L -ndn{((l-|a|)/(l+|o|)) M -d} if < q < ^±lL > 

4r 

( L p -min{((l-|a|)/(l+|a|)) g ,(^f-V9) 2 } ) + r f <q<r> 

p 4r ' 

1(1 + 0(1)), ifq>r. 

The exponent on the right-hand side is minimized at q = ($ + r) 2 / (4r) 
when r < [(1 + y/ l - a 2 )/\a\]d and q = (1 + |o|)(l - y /1 - a 2 )r/(2a 2 ) when 
r > [(1 + Vl — fl 2 )/l a |]^5 where we note that r < [(1 + Vl — o 2 )/|a|]7? and r > 
[(1 + y/l — a 2 ) /| a|]?9 correspond to the optimal and nonoptimal regions of the 
lasso, respectively. This shows that in the optimal region of the lasso, Ajf sso = 
(i? + r)/(2r)r p remains the optimal tuning parameter, at which the sum of 
types I and II errors is minimized, and the type III error has a negligible 



24 



P. JI AND J. JIN 



effect. In the nonoptimal region of the lasso, at Ajf so = (0 + r)/(2r)r p , the 
type III error is larger than the sum of types I and II errors, so the lasso 
needs to raise the tuning parameter slightly to minimize the sum of all three 
types of errors (but the resultant Hamming error is still larger than that of 
the optimal procedure). Combining this with Lemma 4.2 gives the following 
theorem, the proof of which is omitted. 

Theorem 4.1. Set Ap asso = y/2q\ogp. For all choices of q>0, the error 
rate of the lasso satisfies Hamm p (/3 lasso (Ap asso )) > L p ■ s p ■ p~( ,?_r ) 2 /( 4r ) when 
r/ti< (l + Vl-a 2 )/|a| and 

Hamm p ( ( a Iasso (A^ asso )) > L p • s p • /-((i-MHi-^i^)/^ 2 ))^ 
when r/'d > (1 + Vl — a?) /\a\ . 

In [17], we show that when r/'d < 3 + 2i/2, the lower bound in Theorem 4.1 
is tight. The proofs are relatively long, so we leave the details to [17]. 

4.2. Understanding subset selection. The discussion is similar, so we keep 
it brief. Fix 1 < j < p. The major errors that subset selection makes are the 
following (type III is defined differently from that in the preceding section): 

• Type I. Yj is a pure noise, but subset selection takes it as a signal. 

• Type II. Yj is a signal singleton, but subset selection takes it as a noise. 

• Type III. (Yj-i,Yj) is a signal pair, but subset selection mistakes one of 
them as a noise. 

Suppose that Yj is either a pure noise or a signal singleton, and for an 
appropriately large K, no other signal appears within a distance of K from 
it. In this case, except for a negligible probability, /3]±i° = 0, and the subset 

selection acts on site j as hard thresholding [26], /3J 5 = Yj ■ 1{|Yj-| > A ss }. 

Recall that Yj ~ A^(0, 1) if it is a pure noise, and Yj ~ N(t p , 1) if it is a signal 
singleton. Take A ss = A^ s = \/2qlogp as before. Similarly, the expected sum 
of types I and II errors is 

L pP mx s ;)+ P -M>^-T P )} 

(4.5) 

~\L p {p l ^+p 1 ^), if q>r. 

On the right-hand side, the exponent is minimized at q = {j} + r) 2 /4r, at 
which the rate is L p p 1_ ^ +r ) /( 4r ), which is the optimal rate of convergence. 

Next, consider the type III error. Suppose (Yj-i,Yj) is a signal pair and 
no other signal appears within a distance of K for a properly large K. 
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Similarly, since fi is tridiagonal, (Pj-i, (3j S )' is the minimizer of the functional 

|/3f_a + \0j + aPj-tfj - {Yj-ifa-x + YjPj) + ^ / 0} + /{/% / 0}) . 
We call the resultant procedure bivariate subset selection. The following 
lemma is proved in [18], with the regions illustrated in Figure 3. 

Lemma 4.3. The solution of the bivariate subset selection is given by 
0f_ 1 Jf) = (O,O) if is in region I, 0f_ 1 Jf) = (Y j _ 1 ,O) if (Y^Yj) 

is in regions Ha, lie, ((3j S _i, (3j s ) = (0,Yj) if (Yj—i,Yj) is in regions lib, i.i.d. 
and (/3f_ lf /3f ) = C^Sf^ ^T^T 1 ) if {Yj-^i) is m re 9 ions Ula-IIId. 

When (Yj_i,Yj) falls in regions I, Ha or lib, either /SfLj or /3j s is 0, and 
the subset selection makes a type III error. Note there are 0(pe p ) signal 

pairs, and that (Yj—i,Yj)' is jointly distributed as a bivariate normal with 
means (1 + a)r p , variances 1 and correlation a. The type III error is then 

L p pi-(2^+mm{[(VKi^-v^) + ]^2[( v / K^)-V9) + ] 2 }. Combining with (4.5) and 
Mills's ratio gives the sum of all three types of errors. Formally, writing for 
short Hamm p (/3 SS (A„ S )) = Hamm p (/3 SS (A„ S ); e p , t p , a), we have the following 
lemma proved in [18]. 

Lemma 4.4. Set the tuning parameter X p s = \/2qlogp. The Hamming 
error for the subset selection Hamm p (/3 ss (A p s )) is at least 

L„ ■ s .^mi„ {g ^^ + [(VKT^)-v^) + ] 2 } ) if0 < q < (l+lf > 

L P -S P - p - m in{(v^-^) 2 ^+[(v / ^ Z ^)-v^) + ] 2 } 5 if (l+ll. <q<r> 

, s p • (1 + o(l)), ifq>r. 

The exponents on the right-hand side are minimized at q = (i? + r) 2 / (4r) 
if r/0 < [2 - v / T^ 2 "]/[y / l ^?(l - V T^ 2 ) ], and at g = [2i? + r(l - a 2 )] 2 / 
[4r(l - a 2 )] if r/0 > [2 - y/l - a 2 ]/[Vl - a 2 (l - Vl - a 2 )]. As a result, we 
have the following theorem, the proof of which is omitted. 



Theorem 4.2. Set the tuning parameter Ap s = \/2q\ogp. Then for all 
q > 0, the Hamming error of the subset selection satisfies 

Hamm p (/3 ss (A p s )) 



9 _ y/l^ 



> < 



v / T^ 2 "(i-v / T^ 2 ") ; 



Lpp -[2^+r(l-a 2 )] 2 /(4r(l-a 2 ))+^ ) tf - > 



2 



V^^Cl-v 7 !^ 2 ") 
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This gives the phase diagram in Figure 2, where (#, r) satisfying r/i? < [2 — 
\/l — a 2 ]/[Vl — a 2 (l — Vl — a 2 )] defines the optimal region, and ($,r) with 
r/?9 > [2 — \/l — fl2 ]/[Vl — a 2 (l — vT — a 2 )] defines the nonoptimal region. 
Similar to the lasso, the subset selection is able to separate signal singletons 
from the pure noise provided that the tuning parameter is properly set. But 
the subset selection is too harsh on signal pairs, triplets, etc., which costs its 
rate optimality. In [17], we further show that in certain regions of the phase 
space, the lower bound in Theorem 4.1 is tight. 

5. Simulations. We have conducted a small-scale empirical study of the 
performance of the UPS. The idea is to select a few interesting combina- 
tions of ($,#,7T p ,Q) and study the behavior of the UPS for finite p. Fixing 
(p, Tip, Q, 6), let rip = p e and e p =p~' & . We investigate both the random 
design model and Stein's normal means model. 

In the former, the experiment contains the following steps: (1) Generate 

a p x 1 vector /3 by f3j (1 — e p )vq + e p TT p , and an n p x 1 vector z ~ 
N(0,I n ). (2) Generate &n n p x p matrix X the rows of which are samples 
from iV(0 iQ); let Y = X(5 + z. (3) Apply the UPS and the lasso. For 

the lasso, we use the glmnet package by Friedman et al. [14] (Jl is assumed 
unknown in both procedures). (4) Repeat 1-3 for 100 independent cycles, 
and calculate the average Hamming distances. 

In the latter, the settings are similar, except for (i) n p = p, (ii) Y ~ 
iV(ri 1//2 /3, Ip) in step 2 and (iii) f2 is assumed as known in step 3 (other- 
wise valid inference is impossible). We include Stein's normal means model 
in the study for it is the idealized version of the random design model. 

Experiment 1. In this experiment, we use Stein's normal means model 
to investigate the boundaries of the region of exact recovery by the UPS 
and that by the lasso. Fixing p = 10 4 and ft as the tridiagonal matrix 
in (1.12) with a = 0.45, we let i? range in {0.25,0.5,0.65}, and let tt p = v T 
with t p = \J2r logp, where r is chosen such that t p € {5, 6, . . . , 12}. For both 
procedures, we use the ideal threshold introduced in Sections 2 and 4, respec- 
tively. That is, the tuning parameters of the UPS are set as (t*, Xp ps ,Up Pb ) = 

i^2r~ T pi V^^logG 9 )' r p)> an d the tuning parameter of the lasso is set as 

Ajr° = max{^, (i + va-^/a+a))- 1 }^. 

The results are reported in Table 2, where the UPS outperforms consis- 
tently over the lasso, most prominently in the case of $ = 0.25. Also, for 
i? = 0.25,0.5, or 0.65, the Hamming errors of the UPS start to fall below 1 
when T p exceeds 8, 7 or 7, respectively, but that of the lasso won't fall below 1 
until T p exceeds 12,8 or 7, respectively. In Section 1, we show that the UPS 
yields exact recovery when r p > (1 + y/l — "d)y/2\ogp, where the right-hand 
side equals (8.01, 7.32, 7.01) with the current choices of (p, The numerical 
results fit well with the theoretic results. 
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Table 2 

Hamming errors (Experiment 1). UPS needs weaker signals for exact recovery 
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Experiment 2. We use a random design model where (p, 9) = (10 4 , 
0.65, 0.91), and t p £ {1,2, . . . ,7}. The experiment contains three parts, 2a- 
2c. In 2a, we take f2 to be the penta-diagonal matrix = l{i = j} + 

0.4 • — j\ = 1} + 0.1 • l{\i — j\ = 2}. Also, for each r p , we set ir p as 
Uniform(r p — 0.5, t p + 0.5). In 2b, we generate O in a way such that it 
has 4 nonzero off-diagonal elements on average in each row and each col- 
umn, at locations randomly chosen. Also, for each r p , we take n p to be 
Uniform(r p — 1,t p + 1). In 2c, we use a non-Gaussian design for X. In de- 
tail, first, we generate annxp matrix M the coordinates of which are i.i.d. 
samples from Uniform(— s/S, v3)- Second, we generate O as in 2b. Last, we 
let X = (l/- v /n)Mf2 1//2 . Also, for each t p , we take tt p to be the mixture of 
two uniform distributions ^Uniform(r p — 0.5, T p + 0.5) + \ Uniform(— r p — 
0.5, — T p + 0.5). In all these experiments, the tuning parameters are set the 
same way as in Experiment 1. The results are reported in Table 3, suggesting 
that the UPS outperforms the lasso almost over the whole range of t p . 

Experiment 3. The goal of this experiment is twofold. First, we inves- 
tigate the sensitivity of the UPS and the lasso with respect to their tuning 
parameters. Second, we investigate the refined UPS introduced in Section 3. 
Fix q > 0. For the lasso, we take Ap asso = y / 2(?log(p). For the UPS, set the 

U-step tuning parameter as t* = y / 2qlog(p) and let the P-step tuning pa- 
rameters be estimated as in (2.19). Theorem 2.2 predicts that the UPS per- 
forms well provided that q G (max{?9, 5q(1 + r]) 2 r}, + r) 2 /(4r)), so both 



Table 3 

Ratios between Hamming errors and pe p (Experiment 2a.-2c). Bold: UPS. Plain: lasso 
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Fig. 4. Experiment 3a,. x-axis: q. y-axis: Hamming error. Left to right: •& = 0.2, 0.5, 0.65. 

the lasso and the UPS are driven by one tuning parameter q. We now in- 
vestigate how the choice of q affects the performances of the UPS and the 
lasso. The experiment contains three sub-experiments 3a-3c. 

In 3a, we use Stein's normal means model where (p, r) = (10 4 , 3), 7r p = v T 
with t p = y / 2r logp, is the penta-diagonal matrix satisfying = l{i=j\ + 



0.45-1 



{\i-3\=l} 



+ 0.05- l{|i_j|=2} ) and G {0.2,0.5,0.65}. Note that whentf: 



0.65, (max{i?,5§(l + r ? ) 2 r},(i? + r) 2 /(4r)) = (0.65,1) (similarly for other &), 
so we let q G {0.7, 0.8, . . . , 1.1}. 

In 3b, we use a random design model where (p,r,ir p ,£l,q) and the tuning 
parameters are the same as in 3a, but 9 = 0.8 and •& G {0.5,0.65} (the case 
$ = 0.2 is relatively challenging in computation so is omitted). We compare 
the lasso with the refined UPS where in each iteration, we use the same 
tuning parameters as in 3a. 

In 3c, we use the same setup as in 3b, except that we fix q = 1 and let r p 
range in {6, 6.5, . . . , 9}. 

The results of 3a-3c are reported in Figures 4-6, correspondingly. These 
results suggest that, first, the UPS consistently outperforms the lasso, and, 
second, the UPS is relatively less sensitive to different choices of q. 



• - lasso 

-e — ups 
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Fig. 5. Experiment 3b. x-axis: q. y-axis: Hamming error. Left: $ — 0.5. Right: ■& — 0.65. 
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FlG. 6. Experiment 3c. The x-axis is t p , and the y-axis is the ratio between the Hamming 
error and pe p . Left to right: •& = 0.65, 0.5, 0.2. 



Experiment 4. In this experiment, we investigate the effect of larger p 
and n, respectively. The experiment includes two sub-experiments, 4a and 4b. 

In 4a, we use Stein's normal means model where (i?,r) = (0.5,3), O as 
in Experiment 2c, ir p = v T with T p = y/2r logp, and we let p = 100 x {1, 10, 
10 2 , 10 3 , 10 4 }. The lasso and the UPS are implemented as in Experiment 3a, 
where q = l. The results are reported in the left part of Table 4, where the 
second line displays the ratios between the Hamming errors by the lasso 
and that by the UPS. Theoretic results (Sections 1.7 and 4) predict that for 
($, r) in the nonoptimal region of the lasso, such ratios diverge as p tends 
to oo. The numerical results fit well with the theory. 

In 4b, we illustrate that in a random design model, if we fix p and let n 
increase, then the random design models get increasingly close to Stein's nor- 
mal means model. In detail, we take a random design model where (p, r) = 
(10 4 ,0.5,3), tt and ir p as in Experiment 2c and n p = 300 x {1, 3, 3 2 , 3 3 , 3 4 }. 
We also take Stein's normal means model with the same (p, r, f2, ir p ). The 
performance of the UPS in both models is reported in the right part of Ta- 
ble 4, where the last line is the ratio between the Hamming errors by the 
UPS for the random design model and that for the Stein's normal means 
model. The ratios effectively converge to 1 as n increases. 



Table 4 

Left: ratios between the Hamming errors by the UPS and that by the lasso 
(Experiment Right: ratios between the Hamming errors by the UPS for the random 
design model and that for Stein's normal means model (Experiment J(b) 
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10 2 


10 3 


10 4 


10 5 


10 6 


300 


900 


2,700 


8,100 


24,000 


2.43 


5.81 


6.25 


8.80 


10.37 


479.25 


54.04 


12.66 


1.08 


1.01 
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SUPPLEMENTARY MATERIAL 

Supplementary material for "UPS delivers optimal phase diagram in 
high-dimensional variable selection" (DOI: 10.1214/11-AOS947SUPP; .pdf). 
Owing to space constraints, the technical proofs are moved to a supplemen- 
tary document [18]. 
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